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^ Abstract 

' The GANDALF 6U-VME64x/VXS module has been designed to cope with a variety of readout tasks in high energy 

and nuclear physics experiments, in particular the COMPASS experiment at CERN. The exchangeable mezzanine cards 
' O allow for an employment of the system in very different applications such as analog-to-digital or time-to-digital con- 

(/^ versions, coincidence matrix formation, fast pattern recognition or fast trigger generation. Based on this platform, 

P5 we present a 128-channel TDC which is implemented in a single Xilinx Virtex-5 FPGA using a shifted clock sampling 

* ^ method. In this concept each input signal is continuously sampled by 1 6 flip-flops using equidistant phase-shifted clocks. 

^ Compared to previous FPGA designs, usually based on delay lines and comprising few TDC channels with resolutions 

, ^ in the order of 10 ps, our design permits the implementation of a large number of TDC channels with a resolution of 

C/^ 64 ps in a single FPGA. Predictable placement of logic components and uniform routing inside the FPGA fabric is a 

particular challenge of this design. We present measurement results for the time resolution and the nonlinearity of the 
^■^^ TDC readout system. 
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^ 1. The GANDALF Framework 

GANDALF fTl [2| is a 6U-VME64x/VXS carrier board which can host two custom mezzanine cards 
. . (Fig. [T]). It has been designed to cope with a variety of readout tasks in high energy and nuclear physics 

experiments. Depending on the requirements of the desired application, the system can be equipped with 
different types of mezzanine cards. Currently three types of mezzanine cards are available: 8 -channel ADC 
;J] cards, 64-channel LVDS input cards and 64-channel LVDS output cards. Presently under development is a 

^ card with high-speed optical interfaces for data transfer to/from remote detector frontend modules. 

The mainboard comprises two Xilinx Virtex-5 FPGAs. The main FPGA (Virtex-5 SX95T) includes a 
large number of DSP slices which are used for fast signal processing when GANDALF is operated with 
ADC cards (Fig. |2] left) in transient- analyzer mode ISJBJ- For digital I/O applications the board is equipped 
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Fig. 1. Picture of the GANDALF carrier board equipped with two ADC mezzanine cards. The center mezzanine card hosts an 
optical receiver for the COMPASS trigger and clock distribution system (TCS). The VME64x interface is used for configuration and 
monitoring of the board, data is sent to the data acquisition system (DAQ) via the S-Link or the USB interface, and the VXS interface 
is used for inter-board communication. 



with the LVDS I/O cards (Fig. [2j right). The differential signals are routed directly to the main FPGA, where 
the desired logic is implemented. Several applications like a 128-channel scaler, a 64-channel mean-timer 
with subsequent coincidence matrix 1 5 1 and a pattern generator have been successfully implemented so far. 

Fast and deep memory extensions of 144-Mbit QDRII+ and 4-Gbit DDR2 RAM are connected to a 
second FPGA (Virtex-5 LX30T). Both FPGAs are linked to each other by eight bidirectional high-speed 
Aurora lanes with a total bandwidth of 25 Gbit/s per direction. A dead-time free data output can be realized 
by dedicated backplane link cards, following the 160 MByte/s S-Link or the Ethernet protocol. Alter- 
natively a data readout is possible by using the VME64x bus in block read mode or the USB 2.0 port on the 
front panel. The VME and USB protocols are handled by a Xilinx CoolRunner-II CPLD. 

1.1. The LVDS Input and Output Cards 

The LVDS input card provides 64 differential inputs via two VHDC^^ connectors. The signals are 
transferred by differential buffer^ converting signal levels and protecting the FPGA from short circuits and 



^ Very High Density Cable Interconnect 
^On Semiconductor NB4N855S 





Fig. 2. Two possible configurations of the GANDALF module. Left (with ADC cards): transient-analyzer with real-time pulse shape 
analysis and online feature extraction. Right (with LVDS input cards): 128-channel time-to-digital converter. 
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(a) delayed data sampling (b) shifted clock sampling 

Fig. 3. Two possible TDC concepts for implementation in a FPGA. 



electrostatic discharges. The jitter of the signal path including the buffers and the FPGA inputs is below 
20 ps RMS. Additionally, one NIM input and two NIM outputs are available via LEMO connectors, e.g. for 
gating or triggering purposes. With the same PCB a LVDS output card can also be assembled by different 
placement of the components. 

1.2. The Virtex-5 Architecture 

The Xilinx Virtex-5 is a powerful FPGA built on a 65-nm copper CMOS process technology |7|. The 
SX95T contains 160 x 46 Configurable Logic Blocks (CLBs) which are made up of two slices each. Every 
slice contains four function generators (6-input look-up tables), four storage elements (D-type flip-flops or 
latches), fast carry logic, large multiplexers and connections to a switch matrix to access general routing 
resources. Furthermore the FPGA contains 244 blocks of 36-Kbit RAM (configurable as dual-port RAM 
or FIFOs) and 640 DSP48E slices with a 25 x 18 two's complement multiplier and a 48-bit arithmetic 
logic unit usable as adder, subtracter, accumulator or bit- wise logic unit. Six clock management tiles with 
digital clock managers and PLLs are available for input jitter filtering, frequency synthesis, clock division 
and clock phase shifting. The LX30T is a smaller FPGA model based on the same architecture, providing 
approx. 30% of the logic resources and 15% of the memory resources compared to the SX95T. 



2. The GANDALF Time-to-Digital Converter 

This section describes the implementation of 128 TDC channels inside the Virtex-5 FPGA located on 
the GANDALF board. The design objectives for this project were based on the requirements of high-rate 
particle physics experiments. The time resolution is required to be better than 100 ps RMS for precise 
tracking and time-of-flight measurements. The TDC has to be multi-hit capable with a deep hit-buffer and 
a programmable trigger window to select hits in the region of interest around a trigger. A dead-time free 
digitization has to be guaranteed even for bursts of many consecutive hits and triggers. 

2.1. TDC Concepts 

There are different concepts to implement a TDC in a FPGA. A trivial TDC would just sample the data 
signal with one flip-flop, resulting in a TDC bin width of II fdk- Since the clock frequency in a FPGA is 
limited to fdk ~ 500 MHz, one has to subdivide the clock period to achieve the desired resolution. Fig. [3] 
shows two possible concepts: for the delayed data sampling (DDS, Fig. |3(a)| ) the input signal is routed 
through a tapped delay line and the delayed signals from the taps are sampled by flip-flops with one common 
clock. This results in a bit pattern depending on the propagation time of the signal through the delay line 
until the next rising edge of the sampling clock. For the shifted clock sampling (SCS, Fig. 3(b)| ) the input 



signal is routed with minimum skew to a number of flip-flops which are clocked by a set of n equidistant 
phase-shifted clocks clk(/) with / = 0, 1, . . . , ^ - 1. 

Both concepts have of course their pros and cons. While the DDS uses only one common clock, which 
makes it easy to further process the sampled data, the SCS starts from a set of different clock domains which 
have to be synchronized first. The main drawback of the DDS is the allocation of the delay elements in 
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Fig. 4. Signal timing diagram of the 8-bin shifted clock sampling TDC. For a detailed explanation see section[2!2] 



an FPGA. Various routing resources with different propagation delays (like the carry lines or the general 
routing matrix) are available but the delays are non-uniform, so every TDC channel has to be calibrated. 
With the dedicated carry-chains, high-resolution TDCs have been implemented in FPGAs so far, but the 
logic consumption for 128 TDC channels would exceed by far the device resources. 

2.2. Shifted Clock Sampling 

The implementation presented in this article is based on the SCS method. Fig. |4] shows a simplified 
signal timing diagram of an 8-bin TDC. The first line represents the data input signal with a hit (rising 
edge). The 8 TDC clocks clk(/) are equally phase-shifted by A0(/) = / • 27r/8. They clock the 8 TDC flip- 
flops which are all connected to the same data input signal. The flip-flop with a rising clock edge right after 
the hit (in this example the next-to-last one) is the first to sample the new value ('1'). The other flip-flops 
are following shortly after. Once every clock period, the values from all flip-flops are copied to an output 
register. The hit searching algorithm tests the output register for bit patterns ^ "00000000" or "1 1 1 1 1 1 11". 
If a pattern with a 'bitswap' (change from Os to Is, or from Is to Os) is found, the bitswap position together 
with the value of the clock counter contains the time information of the hit. The hit searching algorithm can 
be configured to be leading and/or trailing edge sensitive. 

Due to setup & hold requirements the TDC flip-flops cannot be read out simultaneously like it is shown 
in Figj4] To synchronize the flip-flop outputs, the different clock domains are merged in a two-stage process 
(Fig. (SfTTwo 'partitions' are introduced, each reading half of the flip-flops. Because a bitswap can only be 
detected within a partition, the flip-flops at the partition borders (no. and 4 in the figure) are read from 
both partitions to avoid the loss of hits that might occur on these borders. 

For the sake of clarity the example above describes an 8-bin TDC, however, the final design (Fig. |6]) 
uses 16 TDC flip-flops, hence dividing the bin width by another factor of two. The 8 TDC clocks clk(/) 
(/ = 0. . .7) are phase-shifted by A(p(i) = i • 27r/16, therefore spanning half a clock period. The first 8 
flip-flops are rising-edge triggered, the others are falling-edge triggered by locally inverting the clocks in 
the corresponding slices. The synchronization of the clock domains is performed by using 4 partitions. 
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Fig. 5. Readout of the 8-bin TDC. To meet the setup & hold requirements, two partitions are introduced. Half of the TDC flip-flops 
are read out at a time. 




Fig. 6. The 16-bin TDC design is based on 4 partitions. The squares illustrate the TDC flip-flops and the numbers denote the index / 
of the corresponding clock. Flip-flops that are triggered by a rising clock edge are drawn in blue, falling edge triggered flip-flops are 
drawn in green. 



2.3. 128-Channel TDC Design 

The 128-channel TDC design is segmented into 16 identical blocks of 8 channels each, the so called 
'Fl -blocks' (Fig. [7]). This is done to ease the data collection process by using a two-step procedure. The 
timestamps of the detected hits are stored in a Ik deep hit buffer per channel. The timestamps of incoming 
triggers are buffered in a trigger FIFO, until they are processed by the trigger matching unit. This algorithm 
combines 8 channels at a time by selecting the hits from the respective hit buffers that fall into the trigger 
window and writing them to the output FIFO of the Fl -block. Old hits beyond the trigger latency are deleted 
from the hit buffers. In a last step the data from all 16 Fl -blocks are collected and sent to the DAQ using 
the S-Link interface. Thanks to the segmentation into Fl -blocks, it was possible to use the same data output 
format as the existing hardware based on the Fl TDC chip (Sllgl. 

2.4. FPGA Implementation 

To achieve good linearity in the digitization process, the TDC bin widths have to be as uniform as 
possible. The main contributions to the bin width variations are the clock phase error and the routing 
skew of the data signal to the different flip-flops. The first is very well controlled by the operation of 
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Fig. 7. A Fl -block combines 8 TDC channels into one common output FIFO. 



clocking resources available in the FPGA. Two PLLs are used to generate 8 phase- shifted clocks that are 
distributed over the FPGA via global clock nets. For the 16-bin TDC design, each clock is inverted locally 
inside the slices to generate 8 additional clocks. The routing skew is more difficult to control, because the 
FPGA implementation tools have no handle to influence the router to choose certain connections. Timing 
constraints that are available for traditional FPGA logic only ask for a maximum delay but not for a certain 
value. Hence, the placement of the TDC flip-flops was controlled by user-defined scripts in a way that the 
auto-router inevitably finds appropriate connections. 

The design was floorplanned by defining area constraints for every Fl -block and fixing their positions, 
to support the place and route process. The implementation was carried out separately for each Fl -block and 
the results were saved as design partitions. These partitions were imported in the final implementation run 
where the remaining data merging and interface logic was added. The final design uses 43% of the flip-flops 
and 27% of the LUTs available in the Virtex-5 SX95T. 



3. Measurement Results 

To characterize the time-to-digital converter we developed a pattern generator using the GANDALF 
hardware with LVDS output cards. It generates test pulses for 128 channels with variable delay and rep- 
etition rate. A test setup with 2 pattern generators, 2 GANDALF TDC modules, a trigger control system 
(TCS) 1 10] and a DAQ with S-Link readout was installed. For the measurements we used a clock frequency 
of 388.8 MHz, which results in a TDC bin width of 160 ps. 

The difl'erential nonlinearity (DNL) is a measure for the deviation of the TDC bin width from the nominal 
value. It was determined using code density tests with random pulses. Fig. [SJexemplarily shows the result 
of the measurement for one channel. The time resolution was determined by measuring the delay between 
two channels for a large number of events. Fig. [9] shows the RMS of the delay measurement for all channels. 
The RMS was divided by ^|2 to obtain the resolution for a single time measurement. This results in a TDC 
resolution better than 0.56 • 160 ps / V2 = 64 ps. 
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Fig. 8. Differential nonlinearity (DNL) of an exemplary TDC channel. The other channels show similar values. 
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Fig. 9. RMS of the delay measurement vs. channel number. The RMS is below 0.56 LSB for all channels. For details see text. 



4. Conclusion and Outlook 

A 128-channel time-to-digital converter based on the shifted clock sampling method has successfully 
been implemented in a single Virtex-5 FPGA on the GANDALF module. The TDC base clock has a fre- 
quency of 388.8 MHz and is divided into 16 TDC bins of 160 ps each. The time resolution has been deter- 
mined to 64 ps. With 43% of the available flip-flops and 27% of the available LUTs the device utilization 
of the current design is quite moderate, which allows for future extensions. At the moment work is ongo- 
ing to integrate 128 scaler channels into the same design for simultaneous rate measurements. Inter-board 
communication via the VXS interface is planned for fast trigger decisions. 
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